QVAC-19778 fix[api]: finish_reason=length + unified token accounting across chat-category routes by lauripiisang · Pull Request #2477 · tetherto/qvac

lauripiisang · 2026-06-08T09:02:51Z

🎯 What problem does this PR solve?

finish_reason was hardcoded to "stop" on all chat-category routes, even when generation was cut off by max_tokens (should be "length").
completion_tokens was counted three different ways: stats helper in /v1/chat/completions, whitespace split in /v1/completions (blocking), SSE-event counter in /v1/completions (streaming). The Responses route had its own inline fallback. All four could diverge for the same generation.
/v1/videos* endpoints were missing from both the in-repo and marketing-site docs.

📝 How does it solve it?

drainCompletion(result, onToken?) in adapters/openai/completion-result.ts consumes the SDK result.events stream once and returns { text, toolCalls, stats, stopReason, completionTokens, finishReason }.
Finish-reason precedence: tool_calls > length > stop. All three chat routes (chat, completions, responses) now use it.
Responses API: stopReason === "length" maps to status: "incomplete" + incomplete_details.reason: "max_output_tokens"; the streaming path emits response.incomplete instead of response.completed.
Docs: added /v1/videos* to packages/cli/docs/serve-openai.md and docs/website/content/docs/cli/http-server/index.mdx.

The two other QVAC-19778 follow-up items needed no code change — verified against the current tree:

Streaming cancellation: cancel bridge re-homed by the Fastify rewrite (QVAC-19179 feat[bc]: rewrite serve HTTP layer on Fastify + Zod #2306); every inference route binds req.on('close') → cancel({ requestId }).
Route preamble DRY: resolveAndCheckModel() / requireModel() (added in QVAC-19179 feat[bc]: rewrite serve HTTP layer on Fastify + Zod #2306) already centralise the preamble across all routes.

🧪 How was it tested?

New test/completion-result.test.ts: 20 cases covering eos, length, stopSequence, tool_calls finish-reason branches; stats.generatedTokens vs whitespace-fallback token counting; streaming onToken callback ordering.
Extended test/responses.test.ts and test/responses-streaming.test.ts: assert status: "incomplete" + incomplete_details on max_tokens truncation.
New e2e cases in test/e2e.bats against a real loaded LLM: max_tokens: 1 with a long-output prompt, asserting finish_reason == "length" and usage.completion_tokens == 1 for both blocking and streaming paths.
All 374 CLI unit tests pass (bun run test:unit). Lint and typecheck clean (bun run lint).
All 134 e2e tests pass (npm run test:e2e, Qwen3-600M). Six max_tokens values bumped from 8–24 → 512 in e2e.bats for tests that asserted finish_reason: "stop": these were too tight for Qwen3's chain-of-thought <think> preamble after the fix correctly started returning "length" on budget exhaustion.

🔌 API Changes

// finish_reason now reflects truncation
{ finish_reason: "length" }  // when max_tokens cuts generation short (was always "stop")

// Responses API — length truncation (blocking)
{ status: "incomplete", incomplete_details: { reason: "max_output_tokens" } }
// Responses API — length truncation (streaming)
{ type: "response.incomplete" }  // was "response.completed"

// completion_tokens: now consistently uses stats.generatedTokens (with whitespace-split fallback)
// across /v1/chat/completions, /v1/completions, and /v1/responses

🔗 Dependencies

Depends on #2484 — SDK must emit stopReason: "length" before the two known-failing e2e tests (94, 95) pass.

github-actions · 2026-06-08T09:07:10Z

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

Introduce drainCompletion (completion-result.ts) to consume the SDK event stream once: derives text, stats, stopReason, and finish_reason (tool_calls > length > stop). All three chat-category routes use it; per-route token-counting variants removed. Responses API: length truncation maps to status "incomplete" + incomplete_details.reason "max_output_tokens". Two e2e tests for finish_reason=length are known-failing pending SDK PR #2484 (stopReason="length" not yet emitted by the plugin). Six e2e max_tokens bumped 8-24 → 512 for Qwen3-600M <think> compat. Docs: /v1/videos* added to serve-openai.md and http-server mdx. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

messages[].content now accepts OpenAI array format (text/image_url/ input_audio/file parts). Non-text parts are silently dropped; text parts are concatenated. Fixes 400 errors from Cline and Open WebUI. audio transcription/translation temperature was z.string() — OpenAI spec requires number. Fixes 400 for any client sending temperature:0.0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Browsers (and any client using b64 MP3) send audio that whisper.cpp cannot decode when piped as a raw buffer. By writing the buffer to a temp file with the correct extension and passing the path string to the SDK, format detection runs on-disk rather than on the buffer. Both /v1/audio/transcriptions and /v1/audio/translations use the new path. The temp file is deleted in a finally block regardless of outcome. transcribeOverride type widened to string | Buffer to match the SDK. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…oerce drainCompletion: throw HttpError(502, inference_failed) when completionDone carries stopReason=error (the events stream yields this normally, it does not throw). Throw InferenceCancelledError on cancelled by awaiting result.final after the loop, which is already rejected at that point. Previously both cases returned partial 200. Fix JSDoc comment that wrongly claimed result.events throws on errorDone. audio schemas: z.coerce.number() for temperature on transcriptions and translations. multipartToBody stringifies all non-file fields, so z.number() without coerce rejected valid clients with 400, breaking the documented "temperature is ignored" behavior. Tests: add fakeRun cases for errorDone (expects HttpError 502) and cancelledDone (expects InferenceCancelledError via rejecting final promise). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lauripiisang · 2026-06-09T18:39:46Z

/review

lauripiisang requested review from a team as code owners June 8, 2026 09:02

lauripiisang marked this pull request as draft June 8, 2026 09:06

This comment was marked as outdated.

Sign in to view

lauripiisang force-pushed the feat/QVAC-19778-finish-reason-tokens branch from 2ac8659 to e71da37 Compare June 8, 2026 09:12

lauripiisang changed the title ~~QVAC-19778 feat[api]: audio encoding, finish_reason=length, and token accounting for chat-category routes~~ QVAC-19778 fix[api]: finish_reason=length + unified token accounting across chat-category routes Jun 8, 2026

lauripiisang added the tier1 label Jun 8, 2026

lauripiisang force-pushed the feat/QVAC-19778-finish-reason-tokens branch from e71da37 to f4d049c Compare June 8, 2026 09:28

lauripiisang force-pushed the feat/QVAC-19778-finish-reason-tokens branch from 34b780a to 5a79aa6 Compare June 8, 2026 16:25

lauripiisang marked this pull request as ready for review June 8, 2026 16:51

lauripiisang and others added 2 commits June 8, 2026 23:25

opaninakuffo reviewed Jun 9, 2026

View reviewed changes

Comment thread packages/cli/src/serve/adapters/openai/completion-result.ts Outdated

opaninakuffo reviewed Jun 9, 2026

View reviewed changes

Comment thread packages/cli/src/serve/adapters/openai/completion-result.ts Outdated

opaninakuffo reviewed Jun 9, 2026

View reviewed changes

Comment thread packages/cli/src/serve/schemas/audio.ts Outdated

simon-iribarren approved these changes Jun 9, 2026

View reviewed changes

opaninakuffo approved these changes Jun 9, 2026

View reviewed changes

Merge branch 'main' into feat/QVAC-19778-finish-reason-tokens

157d19a

lauripiisang merged commit cf85d37 into main Jun 9, 2026
10 of 11 checks passed

lauripiisang deleted the feat/QVAC-19778-finish-reason-tokens branch June 9, 2026 20:36

lauripiisang mentioned this pull request Jun 15, 2026

chore[notask|skiplog]: release @qvac/cli 0.7.0 #2598

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QVAC-19778 fix[api]: finish_reason=length + unified token accounting across chat-category routes#2477

QVAC-19778 fix[api]: finish_reason=length + unified token accounting across chat-category routes#2477
lauripiisang merged 5 commits into
mainfrom
feat/QVAC-19778-finish-reason-tokens

lauripiisang commented Jun 8, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

github-actions Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lauripiisang commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lauripiisang commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎯 What problem does this PR solve?

📝 How does it solve it?

🧪 How was it tested?

🔌 API Changes

🔗 Dependencies

Uh oh!

This comment was marked as outdated.

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tier-based Approval Status

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lauripiisang commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lauripiisang commented Jun 8, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading